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1.  Introduction 

This  final  report  summarizes  the  research  results  obtained  with  support  from  ARO 
for  the  grant  "Concurrent  Architectures  for  VLSI  Signal  and  Image  Processing" 
(31080-EL)  supported  under  grant  number  DA/DAAH04-94-G-0405  during  the  period 
Sept.  30,  1994  -  March  29,  1998.  In  this  project  we  addressed  high-speed  and  low- 
power  implementations  of  various  recursive  and  adaptive  digital  filters,  finite  field 
arithmetic  and  error  control  coders,  and  developed  design  methodologies  for  design  of 
folded  or  time-multiplexed  architectures  for  multi-dimensional  and  multirate  systems. 

Digital  signal  processing  is  the  key  technology  in  large  number  of  applications 
such  as  multimedia,  wireless  and  personal  communications,  Gegabit  networks,  and 
Video  processing  including  compression,  storage,  transmission  and  retrieval.  While 
most  previous  designs  had  considered  area-speed  tradeoffs,  current  designs  must  con¬ 
sider  area-speed-power  tradeoffs  since  reduction  of  power  consumption  is  a  key  con¬ 
sideration  in  current  systems.  Reducing  power  consumption  increases  the  battery  life  in 
portable  computers,  personal  digital  assistants,  and  communications  devices,  reduces 
cooling  costs  in  workstations,  and  increases  system  reliability,  mean  time  to  failure  and 
yield.  Reduction  of  power  consumption  is  also  important  for  future  scaled  CMOS 
technologies.  Furthermore,  the  number  of  transistors  per  chip  is  expected  to  reach 
200-500  Million  by  year  2010  as  the  CMOS  technology  scaled  down  to  0.07  micron. 
With  scaling  of  dimensions,  the  power  density  increases  in  a  cubic  manner  with  con¬ 
stant  supply  voltage  and  remains  constant  with  scaled  supply  voltage.  Thus,  it  is  also 
important  to  develop  signal  processing  algorithms  that  can  be  operated  with  low  supply 
voltage.  However,  reducing  the  supply  voltage  increases  the  propagation  delay.  Thus, 
techniques  such  as  pipelining  and  parallel  processing  must  be  exploited  to  compensate 
for  the  reduced  speed  due  to  supply  voltage  reduction.  Power  consumption  can  be 
reduced  by  optimizations  at  various  levels  such  as  system,  algorithm,  architecture, 
logic,  circuit,  layout,  and  device  levels.  Other  approaches  to  reducing  power  consump¬ 
tion  include  reduction  of  capacitance  (by  strength  reduction  and  reduction  in  the 
number  of  operations  in  the  algorithm),  use  of  low-threshold  devices,  and  reducing 
switching  activity  in  the  system  [7]. 

Our  research  supported  by  this  ARO  grant  has  led  to  significant  progress  in 
design  of  high-speed  or  low-power  implementations  of  recursive  and  adaptive  digital 
filters,  finite  field  arithmetic  and  error  control  coders,  and  folded  or  time-multiplexed 
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architectures  for  multi-dimensional  and  multirate  systems.  Our  results  in  high-speed 
and  low-power  digital  signal  processing  architecture  and  algorithm  design  are 
described  in  detail  in  next  sections. 

2.  Recursive  and  Adaptive  Digital  Filtering 

Early  part  of  our  work  continued  the  research  supported  by  the  prior  ARO  grant 
(27076-EL)  on  high-speed  and  low-power  recursive  filters  and  adaptive  equalizers 
[1][2][4][9][10]. 

Pipelined  recursive  lattice  filters  were  developed  for  high-speed  and  low-power 
implementations  [2]  [9].  Another  class  of  recursive  digital  filters,  referred  to  as  orthogo¬ 
nal  digital  filters,  is  useful  for  fixed-point  implementations  since  these  can  be  imple¬ 
mented  using  Given’s  rotations  only.  However,  these  structures  cannot  be  easily  pipe¬ 
lined.  We  applied  the  scattered  look-ahead  transformation  and  polyphase  decomposi¬ 
tion  to  pipeline  the  orthogonal  digital  filters  which  can  be  implemented  using  CORDIC 
operations  only  [17]. 

In  past  work,  we  had  developed  relaxed  look-ahead  transformation  to  pipeline 
various  least  mean  square  (LMS)  adaptive  digital  filters  and  equalizers  [  1  ]  [4]  [10], 
However,  recursive  least  square  (RLS)  adaptive  filters  provide  superior  performance 
compared  with  standard  least  mean  square  (LMS)  adaptive  filters.  The  RLS  structures 
are  typically  not  used  in  implementations  due  to  their  large  hardware  complexity. 
However,  with  scaled  technologies,  chips  can  house  100-500  million  devices  by  year 
2010  and  RLS  adaptive  filters  will  be  used  in  large  number  of  applications  such  as 
beamforming,  equalization  and  in  mobile  phones.  The  RLS  adaptive  filters  make  use 
of  Givens  rotation  and  these  filters  cannot  be  easily  pipelined  due  to  the  presence  of 
feedback  loops  in  internal  and  boundary  cells.  Pipelining  can  lead  to  higher  speed  at 
same  supply  voltage  or  lower  power  consumption  at  same  speed  with  reduced  supply 
voltage.  Thus,  the  concurrency  created  using  pipelining  can  be  traded  off  for  either 
higher  speed  or  lower  power  consumption.  To  alleviate  the  pipelining  difficulty  in 
Givens  rotation  based  RLS  adaptive  filters,  a  novel  rotation  referred  to  as  Scaled 
TAngent  Rotation  (STAR)  was  developed  in  our  group.  The  tangent  rotation  is  scaled 
such  that  it  is  bounded  between  -1  and  +1.  The  STAR  rotation  based  RLS  adaptive 
filters  can  be  easily  pipelined  and  can  be  used  in  high-speed  and  low-power  applica¬ 
tions.  A  lattice  version  of  the  RLS  adaptive  filter  was  developed  in  [  1 1  ]  [  1 6] .  It  was 
shown  that  this  pipelined  filter  provides  same  performance  as  Givens  rotation  based 
RLS  adaptive  filter  in  fixed-point  implementations.  These  topologies  are  expected  to  be 
used  in  equalization  applications  in  different  communications  applications  ranging  from 
ATM  to  asymmetric  or  high-data-rate  digital  subscriber  loops  (ADSL/HDSL)  to  wire¬ 
less  systems  and  mobile  phones. 

3.  Design  Methodologies  and  Applications  to  Wavelets 

Digital  signal  processing  systems  can  be  designed  by  systematic  transformation 
techniques  [3 ]  [5]  [6]  [8]  [  1 5] .  Most  of  the  past  work  was  devoted  to  design  of  single¬ 
dimensional  single-rate  DSP  systems.  We  have  extended  the  transformations  to  accom¬ 
modate  multi-dimensional  and  multi-rate  cases  [24]  [26].  These  transformations  have 
been  used  to  demonstrate  the  design  of  two-dimensional  wavelets. 
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Wavelets  have  received  significant  attention  in  this  decade.  These  are  used  in 
various  audio  and  video  compression  applications.  To  explore  other  applications  of 
wavelets,  a  robust  denoising  system  based  on  wavelets  was  considered  [21]  and  using 
approximate  computations  a  low-area  version  of  the  denoising  system  was  developed 
[19][14]. 

Techniques  to  design  folded  or  time-multiplexed  architectures  for  two- 
dimensional  wavelets  were  developed.  While  methodologies  existed  for  folding  of 
one-dimensional  wavelets,  no  approaches  to  folding  of  two-dimensional  wavelets  were 
known.  Design  of  two-dimensional  wavelets  required  two  important  extensions.  The 
folding  approach  was  extended  from  one-dimensional  systems  to  two-dimensional  sys¬ 
tems  [26]  and  from  single-rate  to  multi-rate  in  the  context  of  two-dimensional  systems 
[25].  The  resulting  two-dimensional  wavelets  designed  using  proposed  systematic 
approaches  led  to  25%  savings  in  number  of  delay  elements  which  can  result  in  con¬ 
siderable  hardware  savings  in  a  video  compression  system  [25]. 

4.  VLSI  Finite  Field  Architectures  and  Error  Control  Coders 

Finite  field  is  used  in  many  applications  such  as  error  control  coding  and  cryptog¬ 
raphy.  Our  research  has  been  directed  towards  design  of  low-latency  and  low-power 
hardware-efficient  architectures  for  finite  field  arithmetic  elements  and  Reed-Solomon 
coders.  Our  contributions  include  a  low-area  finite  field  divider  in  dual  basis  [18],  and 
low-area  and  low-power  digit-serial  finite  field  multipliers  and  squarers  in  most  and 
least  significant  bit  first  modes  [12]  [13].  For  low-power  implementations,  approaches 
to  selection  of  proper  primitive  polynomial  to  reduce  switching  activity  were  also 
developed  [22]. 

For  design  of  error  control  coders  using  programmable  data  paths,  the  finite  field 
multipliers  were  generalized  to  handle  programmability  with  respect  to  both  the  finite 
field  element  size  and  the  primitive  polynomial  [25].  The  power  consumption  in  these 
architectures  were  minimized  by  appropriate  selection  of  digit-size  and  pipelining  lev¬ 
els.  In  many  system  implementations,  reducing  the  energy  of  a  single  arithmetic  ele¬ 
ment  is  not  important.  On  the  other  hand,  it  is  important  to  reduce  the  energy  con¬ 
sumption  of  the  entire  algorithm.  Based  on  a  hardware-software  codesign  approach, 
best  programmable  data  paths  were  developed  for  Reed-Solomon  coders  [20] [23].  For 
example,  it  was  shown  that  energy  consumption  in  a  2-error  correcting  Reed-Solomon 
coder  can  be  minimized  using  a  multiply-accumulate  unit  with  digit-size  8  and  a 
degree  reduction  unit  of  digit-size  2  [23],  It  was  shown  that  separating  the  degree 
reduction  part  from  the  multiply-accumulate  part  can  lead  to  one-third  energy  reduction 
in  a  finite-field  vector-vector  multiplication.  Future  efforts  will  be  directed  towards 
scheduling  strategies  for  low-power  Reed-Solomon  coders  using  more  complex  decod¬ 
ing  algorithms. 

5.  Tutorial  Publications 

Support  through  this  grant  also  led  to  number  of  tutorial  papers  on  VLSI  signal 
processing  system  design  [3]-[7][8][15]. 
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